Concentration rate and consistency of the 
posterior under monotonicity constraints 



Abstract: In this paper, we consider the well known problem of estimat- 
ing a density function under qualitative assumptions. More precisely, we 
estimate monotone non increasing densities in a Bayesian setting and de- 
rive concentration rate for the posterior distribution for a Dirichlet process 
and finite mixture prior. We prove that the posterior distribution based on 
both priors concentrates at the rate (n/ log(n)) -1 / 3 , which is the minimax 
rate of estimation up to a log(n) factor. We also study the behaviour of 
the posterior for the point-wise loss at any fixed point of the support the 
density and for the sup norm. We prove that the posterior is consistent and 
achieve the optimal concentration rate for both losses. 
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1. Introduction 

The nonparametric problem of estimating monotone curve, and monotone densi- 
ties in particular, has been well studied in the literature from both a theoretical 
and applied perspective. Shape constrained estimation is fairly popular in the 
nonparametric literature and widely used in practice (see Robertson et al., 1988, 
for instance). Monotone densities appear in a wide variety of applications such 
as survival analysis, where it is natural to assume that the uncensored survival 
time has a monotone non increasing density. In these problems, estimating the 
survival function is equivalent to estimate the survival time density say / and 
the pointwise estimate /(0). It is thus interesting to have a better understanding 
of the behaviour of the estimation procedures in this case. An interesting prop- 
erty of monotone non increasing densities on R + is that they have a mixture 
representation pointed out by Williamson (1956) 



where P is a probability distribution on M + called the mixing distribution. In 
order to emphasize the dependence in P, we will denote fp the functions ad- 
mitting representation (1). This representation allows for inference based on the 
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likelihood. Grenander (1956) derived the nonparametric maximum likelihood 
estimator of a monotone density and Prakasa Rao (1970) studied the behav- 
ior of the Grenander estimator at a fixed point. Groeneboom (1985) and more 
recently, Balabdaoui and Wellner (2007) studied very precisely the asymptotic 
properties of the non parametric maximum likelyhood estimator. It is proved to 
be consistent and to converge at the minimax rate rt -1 / 3 when the support of 
the distribution is compact. In their paper Durot et al. (2011) get some refined 
asymptotic results for the supremum norm. 

From a Bayesian point of view, frequentist results such as consistency or 
concentration rates are also essentials; The Bayesian approach to nonpara- 
metric problems require the construction of priors on an infinite dimensional 
space. This cannot be done in a completely subjective way. Thus, as argued in 
Diaconis and Freedman (1986), strong consistency of the posterior distribution 
is a key issue in nonparametric Bayesian statistics. Studying concentration rates 
of the posterior provides more refine results that lead to a better understand- 
ing of the behaviour of the posterior. It also allows the comparison between 
Bayesian and frequentist procedures. 

The mixture representation of monotone densities lead naturally to a mix- 
ture type prior on the set of monotone non increasing densities with support 
on [0,L] or R + . For example Ferguson (1983) and Lo (1984) introduced the 
Dirichlet Process prior (DP) and Brunner and Lo (1989) considered the special 
case of unimodal densities with a prior based on a Dirichlet Process mixture. 
The problem of deriving concentration rates for mixtures models have receive 
a huge interest in the past decade. Wu and Ghosal (2008) studied properties 
of general mixture models Ghosal and van der Vaart (2001) studied the well 
known problem of Gaussian mixtures, Rousseau (2010) derive concentration 
rates for mixtures of betas, Kruijer et al. (2009) proved good adaptive proper- 
ties of mixtures of Gaussian. Extensions to the multivariate case have recently 
been introduced (e.g. Shen and Ghosal (2011); Tokdar (2011)). 

Under monotonicity constrained, we derive an upper bound for the poste- 
rior concentration rate with respect to some distance d(-, ■), that is a positive 
sequence (e n ) n that goes to when n goes to infinity such that 

E^(n(d(/,/ )> en |x n ))^o 

where the expectation is taken under the true distribution P$ of the data X n and 
where fo is the density of Pq with respect to the Lebesgue measure. Following 
Khazaei et al. (2010) we study two families of nonparametric priors on the class 
of monotone non increasing densities. Interestingly in our setting, the so called 
Kullback-Leibler property, that is the fact that the prior puts enough mass 
on Kulback-Leibler neighbourhood of the true density, is not satisfied. Thus 
the approach based on the seminal paper of Ghosal et al. (2000) cannot be 
applied. We therefore use a modified version of their results and obtain for 
the two families of prior a concentration rate of order (n/ log(n)) -1 / 3 which is 
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the minimax estimation rate up to a log(n) factor. We extend these results to 
densities with support on IR + and prove that under some conditions on the tail 
of the distribution, the posterior still concentrates at an almost optimal rate. To 
the author's knowledge, no concentration rates have been derived for monotone 
densities on R+. 

Interestingly, the non parametric maximum likelyhood estimator of fp{x) is 
not consistent for x = (see Sun and Woodroofe (1996) and Balabdaoui and Wellner 
(2007) for instance). However, we prove that the posterior distribution of /o is 
still consistent at this point. In fact we prove the pointwise consistency of the 
posterior for all x in [0,L] with L < oo. We then derive a consistent Bayesian 
estimator of the density at any fixed point of the support. This is particu- 
larly interesting as the point-wise loss is usually difficult to study in a Bayesian 
framework as the Bayesian approaches are well suited to losses related to the 
Kullback-Leiber divergence. We also study the behaviour of the posterior dis- 
tribution for the sup norm. This problem has been addressed recently in the 
frequentist literature by Durot et al. (2011). They derive refined asymptotic re- 
sults on the sup norm of the difference between a Grenander-type estimator and 
the true density on sub intervals of the form [e, L — e] where e > avoiding the 
problems at the boundaries. Here, we prove that the posterior distribution is 
consistent in sup norm on the whole support of /o . We also derive concentration 
rate for the posterior of the density taken at a fixed point and for the sup norm 
on subsets of [0,L]. Surprisingly the concentrations rates are not deteriorated 
when considering such losses. It seems that the monotone non increasing densi- 
ties have the same behaviour as 1-H61derian densities in terms of concentration 
rates. However Gine and Nickl (2012) obtained sub-optimal concentration rates 
in sup norm under Holder condition on the true density. Moreover, Arbel et al. 
(2012) proved that a specific prior in the white noise model which leads to mini- 
max posterior concentration rates was sub-optimal under the pointwise loss and 
thus under the sup norm. Therefore in this respect, shape constraints such as 
monotonicity imply simpler behaviour since we obtain the minimax concentra- 
tion rate also under the pointwise loss and the sup norm. 

We now introduce some notations which will be needed throughout the paper. 



Notations For < L < oo define the set Tl by 
T L = if s.t. < / < oo, / r 

We also define &k the fc-simplex that is the set {(si, . . . , S&) € [0, l] fc , J^i=i s « = 
1}. Let KL(pi,p2) be the Kullback Leibler deviation between the densities pi 
and pi with respect to some measure A 




KL(p u p 2 ) = / log — pidX. 
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We also define the Hellinger distance h(pi,p2) between p\ and p-i as 

h 2 (pi,p 2 ) = ^ J (Vpi - VP2) 2 d\ 
Finally, we will say that 5" = o Po (l) if S™ — > under Pq- 

Construction of a prior distribution on Tl Using the mixture representa- 
tion of monotone non increasing densities (1) we construct nonparametric priors 
on the set Tl by considering a prior on the mixing distribution P. Let V be 
the set of probability measures on [0, L}. Thus we fall in the well known set up 
of nonparametric mixture priors models. We consider two types of prior on the 
set V. 

Type 1 : Dirichlet Process prior P ~ DP(A, a) where A is a positive con- 
stant and a a probability distribution on [0, L\. 

Type 2 : Finite mixture P = Y^j=\Pj^j with K a non zero integer and 
5 X the dirac function on x. We choose a prior distribution Q on K and 
given K , define distributions tt x ,k on (a^ij • ■ • , %k ) G [0, L] K and tt p .k on 
(Pi, ■ ■ ■ ,Pk) € &K- 

For X n = {X\ 1 . . . ,X n ), a sample of n independent and identically distributed 
random variables with common probability distribution function / in Tl with 
respect to the Lebesgue measure, we denote II(-|X n ) the posterior probability 
measure associated with the prior II. 

The paper is organised as follow: the main results are given in Section 2, where 
conditions on the priors are discussed. The proofs are presented in Section 3. 

2. Main results 

Concentration rates of the posterior distributions have been well studied in the 
literature and some general results link the rate to the prior (see Ghosal et al. 
(2000)). However, in our setting, the Kullback Leibler property is not satisfied 
and thus the standard theorems do not hold. We then use a modified version of 
the results of Ghosal et al. (2000) considering truncated versions of the density 
/. This idea has been considered in Khazaei et al. (2010) in a similar setting. 
The following theorem gives general conditions on priors to achieve a posterior 
convergence rate [nj log(n)) -1 / 3 for densities in Tl with < L < oo. 

Theorem 1. Let X n = (Xx,...,X n ) be an independent and identically dis- 
tributed sample with a common probability distribution function /o such that 
fo G Tl with < L < oo. Let a be a probability density with respect to the 
Lebesgue measure with support included on M + such that a > on [0, L] that 
satisfies for 9 sufficiently small, and t > 1 



a(0) < 9* 



(2a) 
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Define also Q a probability distribution on N* and tt p ,k a probability distribution 
on &k satisfying for some positive constants C\, C%, Oi, . . . , ok, c 

e - Cl Klo g (K) > g (x) > e -C 2 Klo g (K) 

I \ ~\ Ts^ — K K CL\ ^Q-K 

Tp,k[px,---,PK)>K c Pi-.-Pk 

and finally, let (xi)fL 1 be the order statistics of K independent and identically 
distributed random variables from a. If d(-, •) is either the L 1 or Hellinger dis- 
tance, then for II a Type 1 or Type 2 prior, there exists a positive constant C 
such that 

n (f, d{f, f ) > C (j^y) ' \*A ^ 0, P a.s. (3) 

when n goes to infinity, where C depends on /o only through L and an upper 
bound on /o(0). 

The proof of Theorem 1 is given in section 3. Conditions (2) are roughly the 
same as in Khazaei et al. (2010). Theorem 1 is thus an extension of their results 
to concentration rates. Under some additional conditions on the tail of the true 
density, namely we require exponential tails, we get the posterior concentration 
rate for density with support on R + . 

Theorem 2. Let X n = (X\, . . . , X n ) be an independent and identically dis- 
tributed sample with a common probability distribution density fo such that 
fo G -^co md fo(x) < e~^ x for (3 and r some positive constants and x large 
enough. Let II be as in Theorem 1. Then for some positive constant C we have 
for d(-,-) either the L\ or Hellinger metric 

n (d(fp,f a ) > (n/log(n))- 1/3 log(n) 1 / T |X n ) -> 0, P a.s. (4) 

when n goes to infinity. 

Note that considering monotone non increasing densities on R + deteriorates 
the upper bound on the posterior concentration rate with a factor log(n) 1/,T . It 
is not clear whether it could be sharpen or not. For instance, in the frequen- 
tist literature, Reynaud-Bouret et al. (2011) observe a slower convergence rate 
when considering infinite support for densities without any other conditions. In 
a Bayesian setting, a similar log term appears in Kruijer et al. (2009) when con- 
sidering densities with non compact support. However this deterioration of the 
concentration rate does not have a great influence on the asymptotic behaviour 
of the posterior. Note also that the tail conditions are mild as r can be taken as 
small as needed, and thus the considered densities can have almost polynomial 
tails. 

The above results on the posterior concentration rate in terms of the L\ or 
Hellinger metric are new to our knowledge but not surprising. The specificity 
of these results lies in the fact that the usual approach based in the Kullback 



(2b) 
(2c) 
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Leibler neighbourhoods to bound from below the marginal density cannot be 
used here, as explained in section 1. 

The following results consider the pointwise loss function for which no re- 
sults exist in the Bayesian nonparametric literature, apart from the paper of 
Gine and Nickl (2010). In this paper, the authors propose to use a global test 
based on a frequentist estimator instead of a finitely many local tests considered 
in Ghosal et al. (2000) which leads to the entropy condition (2.2) in their Theo- 
rem 2.1. When applied to Holderian classes of densities and under the sup norm 
loss, this approach induces a suboptimal concentration rate. Interestingly, under 
shape constraints, as in our case, this leads to the minimax rate of concentration 
(up to a log(n) term). We first state a consistency result over [0, L}. 

Theorem 3. Let x be in [0, L] with with < L < oo but x < oo. Let f £ Tl 
such that /o exists near x and f' a (x) < 0. Let Xi , i = 1, . . . , n and II be as in 
Theorem 1. Then, for all x in [0, L] with x < oo, and e > 



We thus have a pointwise consistency of the posterior distribution of fo(x) 
for every x in the support of /o which also includes the boundary parts. Note 
that the maximum likelihood is not consistent at the boundaries of the support 
as pointed out in Sun and Woodroofe (1996) for instance. In particular it is not 
consistent at and when L < oo, it is not consistent at L. It is known that 
integrating the parameter as done in Bayesian approaches induces a penalisa- 
tion. This is particularly useful in testing or model choice problems but can also 
be effective in estimation problems, see for instance Rousseau and Mengersen 
(2011). The problem of estimating /o(0) under monotonicity constraints is an- 
other example of the effectiveness of penalisation induced by integration on 
the parameters. However, contrariwise to Rousseau and Mengersen (2011), we 
have not clearly identified how the penalisation acts, and only observe that it 
leads to a consistent posterior distribution and a consistent estimator. Some 
refined results on the asymptotic distribution of a monotone non increasing 
density at a fixed point can be found in Prakasa Rao (1970) and more recently 
Balabdaoui and Wellner (2007), however, none of these results stands for points 
in the whole support of fo. 

The following Theorem gives an upper bound for the concentration rate on 
the posterior distribution under the pointwise loss. 

Theorem 4. Let /o and II be as in Theorem 3 and let x be in (0, L), then for 
C a positive constant 



Consider the posterior median fn( x ) 
lows that 





(6) 





when n goes to infinity. 
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We derive from Theorem 3 the consistency of the posterior distribution for 
the sup norm. This is particularly useful when considering confidence bands, as 
pointed out in Gine and Nickl (2010). Under similar assumptions as in Durot et al. 
(2011), we get the consistency of the posterior distribution for the sup norm. 
Note that contrariwise to Durot et al. (2011), we do not restrict to sub- intervals 
of the support of the density. This is mainly due to the fact that the Bayesian 
approaches are consistent at the boundaries of the support of fo- 

Theorem 5. Let fo € J-'l with < L < oo be such that /q exists and ||/g||oo < 
oo and /g(0 + ) < 0. Let also the prior II be as in Theorem 1. Then 



Similarly to before, we study the concentration rate of the posterior distri- 
bution for the sup norm loss. Durot et al. (2011) prove that the non parametric 
maximum likelihood estimator achieves the optimal rate when restricting the 
norm to a compact strictly included in the support of /g. This is mainly due to 
the behaviour of the non parametric maximum likelyhood estimator near the 
boundaries as pointed out in Kulikov and Lopuhaa (2006). Although we could 
obtain consistency of the posterior distribution in sup norm over [0, L], contrari- 
wise to what appends with the MLE, in the following Theorem, we only get a 
posterior concentration rate for the sup norm on sub intervals of [0, L\. It is not 
clear to us that the rate (n/ log(n)) -1 / 3 can still be attained for the sup norm 
over the whole interval [0, L\. 

Theorem 6. Let fo 6 Tl with < L < oo be such that f exists and ||/g||oo < 
Cg < oo and /'(0 + ) < 0. Let also the prior II be as in Theorem 1. Let < a < 
b < L and C > be some fixed constants, then 



3. Proofs 

In this section we prove Theorems 1 to 6 given in Section 2. The proofs of 
Theorem 1 and 2 are based on a piecewise constant approximation of fo in the 
sense of Kullback Leibler divergence following Khazaei et al. (2010). Note that 
our approximation differs from theirs, which was adapted to the L 1 distance 
but not to the Kullback-Leibler divergence. Our approach is similar to the con- 
struction proposed in the proof of Theorem 2.7.5. of van der Vaart and Wellner 



n( sup \f P (x) - f (x)\ > e\X n ) 



(8) 



xe[0,L] 




(9) 



(1996). 
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3.1. Proof of Theorems 1 and 2 

We first give a series of Lemmas used to prove Theorem 1. The proofs of these 
Lemmas are postponed to Appendix A. We focus on densities on Tl with < 
L < oo and then extend the results to the case L = oo with exponential tails. 

The piecewise constant approximation of fo is base on a sequential subdivi- 
sion of the interval [0, L] with more refined subdivisions where fo is less regular. 
We then identify a finite piecewise constant density by a mixture of uniform. 
The following Lemma gives the form of a finite probability distribution Pq such 
that fp o is in the Kullback-Leibler neighbourhood of fo. 

Lemma 7. Let f G T L be such that /(0) < M < +oo. For all < e < 1 

there exists m < L 1 ' s M 1 ' 3 e~ 1 , p = (pi, . . . ,p m ) & &m and x = (x\, . . . ,x m ) G 
[0,L] m such that P = Y]j—i 5 Xi pi satisfies 

KL{f, f P ) < 6 2 , J (log (J-^j ) / < e 2 , (10) 

where fp is defined as in (1). 

The proof of this Lemma is postponed to appendix A. The proof of Theorem 
1 is adapted from Theorem 2.1 of Ghosal et al. (2000). It consists in obtaining a 
lower bound on the prior mass of Kullback Leibler neighbourhoods of any density 
in Tl ■ An interesting feature of mixture distributions whose kernels have varying 
support is that the prior mass of the sets {f,KL(fo,f) = +00} is 1 for most 
fo £ Tl- Hence we cannot apply directly the result of Ghosal et al. (2000). We 
thus extended the approach used in Khazaei et al. (2010) to the concentration 
rate framework and get similar results as those presented in Ghosal et al. (2000). 
Let fo be in J-l, and define 



S n (en,Qn) = j /, KL(f n , / ,„) < £ 2 , 

/„,„(*) (log [j^^dx < el,£ f(x)dx > 1 - el I (11) 



where 



and 



9 n = ini{x, 1 - F (x) > ^} 
2n 

, M _ /(-)I[o,e„](-) f i s _ /o(-)I[o.e„](-) , 19 s 
M-J- F{6n) , /o,n(-)- Fo(9n) (^) 

Lemma 8. Let II be either a Type 1 or Type 2 prior on Tl satisfying (2) and 
let S n (e n ,9 n ) be a set as in (11), then 



ll(S n (e n , n ) > exp I de; 1 (l + | log(e»/n)|) log ( \ g(e n /n)\) ) [ (13) 
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This lemma is proved in appendix A. The e metric entropy of the set of 
bounded monotone non increasing densities has been shown to be less than e -1 , 
up to a constant (see Groeneboom (1986) or van der Vaart and Wellner (1996) 
for instance). As the prior puts mass on Tl, on which /(0) is not uniformly 
bounded, we consider an increasing sequence of sieves 

T n = {feF L ,f(0) < Mn} (14) 

where M n = cxp jc?! 1 / 3 log(n) 2 / 3 (t+ 1) _1 | with t as in Theorem 1. The follow- 
ing Lemma shows that T n covers most of the support of II as n increase. 

Lemma 9. Let T n be defined by (14) and II be either a Type 1 or Type 2 prior 
satisfying (2). then 

n(^ n c ) < e - c " 1/3log ^ 2/3 



Here again, the proof is postponed to appendix A. We now get an upper 
bound for the e- metric entropy of the set T n . Recall that in Groeneboom (1985) 
it is proved that the L\ metric entropy of monotone non increasing densities 
on [0,1] bounded by M can be bounded from above by Co log(M)e~ 1 . We 
thus cannot use the bound M n in the definition of the set J- n in (14) as it 
would give a suboptimal control of the to construct tests in a similar way as in 
Ghosal et al. (2000). However using a modified version of their results presented 
in Rivoirard and Rousseau (2009) we only have to bound the e-metric entropy 
of the sets 

Tn.j = {/ G Fn,je n < d(f, fo) < (j + l)e„} 

for j G N*. We can easily adapt the results of Groeneboom (1985) to densities 
on any interval [a, b] and get the following Lemma. 

Lemma 10. Let T be the set of monotone non increasing functions on [a, b] 
such that for all f in J^,J^ f < Ma and f < M , then 

N(e, F, d)<6~ 1 log(M + 1) ((6 - a) + 3M 2 ) 



The proof of this Lemma is straightforward given the results of Groeneboom 
(1985) and is thus omitted. Let x n ,j G [0,£] such that e„/2 < x n j < e n . We 
denote for all / in T n j fij = fl[o,x nij ) and h,j = f\x nJ .L]- Since for all / in 
T n j we have J Q \f(x) - f (x)\dx < (j + l)e„ thus 

rx^.j rx n , 3 

I f{x)dx - I fo(x)dx < (j + l)e n 
Jo Jo 

which implies that 



Xn,jf(x n ,j) < X n jf (0) + (j + l)e n 
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which in turn gives 

f(x n ,j) < /o(0) + 2(j + 1) 

Recall that for all / S T n we have /(0) < M n . Using Lemma 10, we construct 
an e„/2-net for the set J 7 * j = (fi,j, f G with iVi points, and 

log(iV!) < e" 1 log(M„ + l)e„(j + 2) 

and thus deduce 

Iog(JVi) < C'ne^' 2 (15) 
Similarly, given that f(x n j) < M + 2(j + 1) we get an e„/2-net for the set 
J-"n,j = |/2.j, / G J- with N 2 points and 

Iog(JV 2 ) < C'ne 2 n j 2 (16) 

This provide a e„-net for J- n ,j with less than N\ x N2 points. Given (15) and 
(16) the L\ metric entropy of the sets J^nj satisfy 

logiNiTn^e^L^^nelf (17) 

Here we only control the entropy of slices of the sieves, and thus cannot apply 
directly the results of Ghosal et al. (2000). A modified version of this theorem 
proposed in Rivoirard and Rousseau (2009) (Theorem 4.1) can be used in that 
case. Nonetheless, this result requires the Kullback-Leibler property which is not 
satisfied here. Given Theorem 2.1 of Khazaei et al. (2010), we get that the same 
result stand in the convergence rate setting by replacing e by e 2 in their proof. 
We thus derive a modified version of Theorem 4.1 of Rivoirard and Rousseau 
(2009) given in appendix B which is valid without the Kullback-Leibler property. 
Thus Theorem 1 is proved. 

Extention to R + Given that fo(x) < e~^ xT when x goes to infinity, if 9 n is 
such that 6 n = inf{x, 1 — Fq(x) > e„/(2n)} then 9 n < (\og(n)) 1/,T . Similarly to 
before, we can approximate a restriction of / to [0, 9 n ] by a piecewise constant 
function g as in Lemma 7. Setting g(x) — if x > 9 n we get an approximation of 
/ and Lemma 7 still holds. Similarly, Lemma 8 still holds under the exponential 
tail assumption. We now get an upper bound for the e-metric entropy of J- n ,j- 
Similarly to before we split J- n ,j into two parts. The construction of an e„/2- 
net for T\ a does not change and therefore (15) holds. Finally, let ■ = {/ G 
J^jjVx > 9 n ,f(x) = 0}. Given Lemma 10, we get for c\ > large enough an 
e„/(2ci(j + l))-net for T\ • by considering /* the restriction of / to [x n j,9 n ]. 
We have 

d(f, f)<c 2 {j+l)e n 

where d(-, •) is either the L\ or Hellinger distance. Hence, for c\ > C2 an e/2-net 
for J 7 ^ j with at most e C3nt ^ points and thus 

log (N (^ d , e n , d) <C"ne 2 n j 2 
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We conclude using the same arguments as in the preceding section, and thus 
Theorem 2 is proved. 

3.2. Proof of Theorems 3 and 4 

Define /(0) as lim^o f(x) = f{0 + ). Since / is monotone non increasing, this 
limit exists and is uniquely defined. It is well known that the maximum likelihood 
estimator is not a consistent estimator of /o(0 + ) (see Balabdaoui et al. (2011) 
for instance). In this section, we prove that the Bayesian approaches we consider 
do not suffer from this and that the posterior distribution of the density at a 
any fixed point of the support of fo is consistent. Moreover, we prove that the 
Bayesian approach yield a consistent estimator, namely the posterior median. 
For the sake of the presentation, we only consider the case L = 1. The same 
results holds for any L such that < L < oo 

We first prove consistency of the posterior distribution for f(x) for x £ [0, 1] 
and thus prove the first part of Theorem 3. 

We must therefore prove that if A t = {P, \fp(x) — f(x)\ > e} 

n(A e |x n ) = 0po (i) 

Note that Theorem 2.1 implies that 

n[|/ P -/o|i <e„|X] = l+ 0po (l) 

so that we can substitute A e with A e PI {\fp — /o|i < which we still denote 
A e . Let c„ = 5(log(n)) 1 / 3 /e. We have 

n(|/Kz)-/(x)|>e|X n ) = ^ (18) 

where 

N n ,e= [ I At T\ ^f( X *) dIl ( P ) aild ^« = ! l \f-f \<tJ-f{ X i) dI l{P) 
J JO J JO 

Let /„ be the maximum likelihood estimator of /o if < x < 1. Let f n be 
such that 

|7n(l - tfcn-VS) iSx=l 

fn{x) = I f n (c n n^ 3 ) if X = 

\f n (x) otherwise 

We define the test functions (\> x n = I,j f x )—f (x)\>c n- 1 /' 6 - ^ ^ s known from 
Prakasa Rao (1970) and Kulikov and Lopuhaa, (2006) that for a fixed x, f n {x) 
is consistant so that 

E h m=o{\) (19) 
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We now obtain an exponentially small bound for the type II error of the test 

\fn(x) - f (x)\ > \f( X ) - f (x)\ - \f n (x) - f(x)\ 

Following Kulikov and Lopuhaa (2006) we consider the inverse process U n 
defined as U n (a) — argmax t> o(i r 'n(' : ) — at) where F n is the empirical cumulative 
distribution function, with a = e/2 + f(x). As \fp(x) — fo(x)\ > e. We denote 

{c n n-^ 3 if x = 
1 — c n nT x l 3 if x = 1 
x otherwise 

We thus have for n large enough 

Pf(\fn(x) - fo(x)\ < c n n-^) < P f {\Ux) - f(x)\ > 6 -) (20) 
which leads to 

Pf(\fn(x) - f (x)\ < c n n^ 3 ) < P f (U n (e/2 + f(x)) > x n ) 
We now recover the direct process and obtain 

Pf(\fn(x) - /o(s)| < cn- 1 / 3 ) < P f ( sup (F n (t) - at) > sup (F„(t) - at)) 

v ' M>x„ t<x„ ' 

Note that, for all t n < x n , 



F(t)-at-(F(t n )-at n ) = 

f (/(«) - fix) -{)du- J J (/(«) - f(x) -\)du< - e -{t - tn). 
Since / is monotone non increasing, we thus deduce that, for t n < x n 



Pf 



sup (F n (t) — at) > sup (F n (t) — at) 

i>cn- 1 / 3 t<cn~ 1 / 3 



<Pf 

<Pf 
<Pf 



sup (F n (t) - at) > (F n (t n ) - at n ) 

tycn- 1 / 3 



sup (F n (t) - F(t) - at) - (F n (t n ) - F(t n ) - at n ) + F(t) - F{t n ) > 

*>c„n- 1 / 3 



sup(F n (t) - F(t)) > ~(x n -t n ) 

t>0 4 



and, taking t n = x n /2 we get 
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sup (F n (t) — at) > sup (F n (t) — at) 

t>x n t<x n 

< Pf 



sup</n(F n (t) - F(t)) > -(y/nx n ) 



t>0 



We now have to study the empirical process y/n(F n (t) — F(t)) to control N n . 
These processes have been widely studied ; van der Vaart and Wellner (1996) 
obtained some large deviation inequality In order to use van der Vaart and Wellner 
(1996) results, we have to control the metric entropy of the set Q = {Im t y,t £ 
[0, 1]}. The following Lemma gives an upper bound for iV[](e, Q, L-i). The proof 
is postponed to Appendix A. 



Lemma 11. Let Q = {I[o,tj; t € [0,1]}, thus 



N [] (e,Q,L 2 )<C(- 



1/2 



Now using Theorem 2.14.9 of van der Vaart and Wellner (1996), we have 



p f[\fn(x) - fo{x)\ < c n n 



-1/31 < 



1/2 g — nx^ 



/32 



and thus have 

Pf(\fn(x) - / (X)| < C,^- 1 / 3 ) = E(l - <(> n ) < 



»/33 



(21) 



Similarly to the proof of Theorem 1, following Khazaei et al. (2010), we get 
an exponentially small lower bound for D n . More precisely, we get that 

D n > 2e- (c+2)ne " 
with probability that goes to 1. Note that 



EE 



N„ 



°\D n 



<E^)+P "p n <e 



-(c+2)ne 2 n 



+ 



E^(n[j- r c jx n ]) + e ( c+2 )" e - / - 4> x n)cm(f) 

JA € (XF n 



(22) 



Given the preceding results, we have 

= (c+2)n£ "supE^(l-^) 



I' (7^) "HI 



/ 
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And with equation (21), given that for n large enough x n > c„n -1 / 3 we get 
that 

which ends the proof the first part of Theorem 3. Note that for x in the interior 

— 1/3 

of the support of fo the same proof hold if we replace e by e n = (n/ log(n)) 
We thus get the proof of Theorem 4. 

Consistency of a bayesian estimator We consider in this section /^(t), 
the Bayesian estimator associated with the absolute error loss, define as the 
median of the posterior distribution. Consistency of the posterior mean, which 
is the most common Bayesian estimator is however not proved here but could 
nevertheless be an interesting result. 

We first define /^(i) such that 

ft(t) = mf{x, n[/ P (i) < x\X"} > 1/2} (23) 

In order to get consistency in probability we note that if /^(t) — fo{t) > e 
then 

H(/p(f) >/ (i)+e|X n ) >l/2 
And if &(t)-/ (t) <-ethen 

n(/pW</oW-e|X n )>l/2 
We deduce, with Markov inequality and Theorem 1 



PZitit) - fo(t) > e) < P n (n(/p(*) > fo(t) + e|X n ) > 1/2) 

< 2E^(n(/p(t)>/ (t) + e|X n )>l/2) 

< o(l) 

and similarly 

PZ(f:(t)-fv(t)<-e)<o(l) 

Thus we have Po(\fn(t) ~ fo(t)\ > e) —> which gives the consistency in 
probability of f^{t). 



3.3. Proof of Theorems 5 and 6 



In this section we prove that the posterior distribution is consistent in sup 
norm. We prove that, if the posterior distribution is consistent at the points of 
a sufficiently refined partition of [0,L] then it is consistent for the sup norm. 
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Here again, we will only consider the case L — 1 without loss of generality. We 
first denote 

B e =\f, sup{|/(;r)-/o(aO| >el 

{ [0,L] J 

Let C' be a positive constant such that ||/ol|oo < C' Q and let (xi)i be the 
separation points of a e/(8Co) regular partition of [0, 1] and p = Card{(:Ei)i}. 
Note that 

v 

B e =\J{f, sup {|/(ar) - / (a:)| > e}. 

i = l [Xi,X i+1 ] 

Recall that A% = {/, \f{x) - f (x)\ > e}. We consider the set B e ff l=1 (A x e j 8 ) c . 
Note that given Theorem 3, we have that 

E^n^Q(^; 5 )|x"n = (i) 

For / S B e we have for all x € [xi, a^j+i], 

\I(X) - f (x)\ < \f(x) - f( Xi )\ + \f( Xi ) - f (Xi)\ + \f (xi) - f (x)\. 

Given that / is monotone non increasing, and given the hypotheses on /o we 
have 

\f(x)-f(x l )\<\f(x l+1 )-f(x l )\ 

< \f(xi+i) - /o(asi+i)i + \fo(xi+i) - fo(xi)\ + \fo(xi) - f(xi)\ 

< 3e/5 

and for the same reasons 

\f(xi) - fo(xi)\ + \f {xi) - / (ar)| < 2e/5. 

Which leads to 

\f(x)-f (x)\<e 
and thus, taking the supremum over x, we get 

sup \f(x) - f (x)\ < e. 

We then deduce 

n(i? e |X")<n^ e n|n(^; 5 ) c }) + n (y^e; 5 )J =^o(i) 

Which gives the consistency of the posterior distribution in sup norm 



We now prove that we get optimal concentration rate when considering the 
sup norm over any fixed sub-interval of [0, 1] . To do so, we use the same approach 
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as before, replacing e by e n = (nj log(n)) -1 / 3 . We thus have an increasing num- 
ber of bins p — p n in the partition (xi)i. Note that p n e^ 1 . 

We first control the rate of convergence of the type I error of a single test 
define as before. Prakasa Rao (1970) found the asymptotic behaviour of the 
non parametric maximum likelyhood estimator at a fixed point t € (0, L) and 
prove that 

n 1/3 \fn(t) - f (t)\ ^2CZ (24) 

with C a positive constant and Z = argmax{W / (w) — u 2 u € R} with W a 
standard Winer process on R originating from 0. Groeneboom (1989) study 
very precisely the process Z. Given point (iii) of Corollary 3.4 of Groeneboom 
(1989) and equation (24), we get that for t € (0, L) 

P?(\fn(t)-f(t)\>c n n-V 3 )<o(e- 1 ). 
We thus deduce that 

]TE£(C)<o(l). 
Similarly to (18), we decompose U(A* |X n ) such that 

mA . , xn) = Jknf(^)rfn(P) = n» 
fi lf - M < en % (Xi)da(p) D n 

The proof of Theorem 2.1 of Khazaei et al. (2010) gives us 

n(£>„ > 2e- (c+2)ne ") < 
and similarly to (22) we get for all x 6 (0, 1) 

E^(n[J^|X n ]) + e ( c + 2 )"^ / E'/(l - ^)dU(f) 
The two last terms of (25) are exponentially small. We thus deduce that 

U(\J(A: :/s )\X n )<^E n (^)+ Pn -^+o(l)<o(l) 

i=l i=l n£ ™ 

Similarly to before, we prove Theorem 6. 



(25) 
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4. Discussion 

In this paper, we obtain an upper bound for the concentration rate of the pos- 
terior distribution under monotonicity constraints. This is of interest as in this 
model, the standard approach based on the seminal paper of Ghosal et al. (2000) 
cannot be applied given that the Kullback Leibler property is not satisfied. We 
prove that the concentration rate of the posterior is (up to a log(n) factor) 
the minimax estimation rate (nj log(n)) -1 / 3 for standard losses such as L\ or 
Hellinger. 

More interestingly, we prove that the posterior distribution is consistent for 
the pointwise loss at any point of the support and for the sup norm loss. We 
also derive almost optimal concentration rates of the posterior distribution for 
these losses. Gine and Nickl (2012) derived results on posterior concentration 
for the sup norm. They prove that the posterior concentration rate in this case 
is suboptimal under Holder constraints on the true density. Note that similarly 
to Gine and Nickl (2012), our results on the sup norm are proved using tests 
constructed from a frequentist estimator /„ of the true density. Pointwise loss 
correspond to estimating an unsmooth linear functional, see Arbel et al. (2012) 
or Li and Zhao (2002) and the recent literature on concentration rates in semi 
parametric models, in particular for functional of curves, shows that this is 
a difficult problem, see Rivoirard and Rousseau (2009) or Kleijn and Knapik 
(2012). Interestingly in the context of shape constrained densities, it turns out 
to be not so difficult. 
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Appendix A: Technical Lemmas 
A . 1 . Proof of Lemma 7 

Proof. For a fixed e, let / be in Fl- Consider Vo the coarsest partition : 

= <C == L, 
at the i th step, let V% be the partition 

= x\ < x\ < ■ ■ ■ < x\ H = L, 

and define 

£l = max - f{x))){x) - xy,) 1 ' 2 } . 

For each j > 1, if (/(ac*._ 1 ) — f(x l J ))(x l J — a;*-^) 1 / 2 > we split the interval 
[xj-i,Xj] into two subsets of equal length. We then get a new partition Vi+i. 
We continue the partitioning until the first k such that e\ < e 3 . At each step i, 
let ni be the number of intervals in Vi, si the number of interval in Vi that have 
been divided to obtain Vi+i, and c = l/v2- Thus, it is clear that Ej+i < cEi 



■nicetf/ 3 < ^(/(4-i)-/(4)) 2/3 (4-4-i) 1/3 

3 

using Holder inequality. We then deduce that 
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j=i j=i j=i i=i 

< 2M 2 ^L 1 / 3 e- 2/3 2 1 / 3 J2j2-^ 3 

3 = 1 

where if = 2(1 - 2- 2 / 3 )- 2 . Thus 

n k < K M 2 / 3 L l / 3 e~ 1 . (26) 

Now, for / € Fl : we prove that there exists a stepwise density with less than 
K M 2 / 3 L 1 I 3 \ pieces such that 

KL{f,h) < e 2 and J flog(^) 2 (x)dx < e 2 (27) 
In order to simplify notations, we define 

X{ — Xj , ll — Xl Xi—\) CJ{ — f(Xi—l^ 

We consider the partition constructed above associated with f 1 ^ 2 , which is 
also a monotone nonincreasing function that satisfy / 1 ^ 2 (0) < A/ 1 / 2 (instead of 
M). We denote g the function defined as g(x) — J2^[x i _ 1 ,x i ]( x )gi 

(f/*-gf{x)dx = £ / (f/ 2 -g) 2 (x)dx 

i=l 
»fc 

<£/ iJ 1,2 {*U)-f 1/2 (4)) 2 dx 

i=l 

< ^fc£fc < i 1/3 -fiToA^ 1/3 e 2 
We then define /i = -9-% and and get an equivalent of J g 2 . 



g 2 dx= I (g 2 - f)(x)dx + l 

(g-Vf)(g + Vf)(x)dx + l 

l + 0(e) 
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and deduce that (J g 2 ) 1 ^ 2 = 1 + 0(e). Let H be the Hellinger distance 



H 2 (Lh) = H 2 (/,-^) 

<H 2 (f,g 2 )+H 2 (g 2 ,^) 
J 9 2 

< L^KoM^e 2 + J(g- — °f( x )dx < e 2 



Since ||///i||oo = ll//<7 2 ||oo(/ g 2 ) < (J.9 2 )i together with the above bound 
on H(f, h) and Lemma 8 from Ghosal and van der Vaart (2007), we obtain the 
required result. 

Let P be a probability distribution defined by 

rajs 

P = ^p l 5{x k i ) Pi = (hi-! - hi)x\ p rik = K k x k nk = h nk L 

i=! 

thus fp = h and given the previous result, lemma 7 is proved. □ 



A. 2. Proof of Lemma 8 



Proof. For e„ as in Theorem 1, define 6 n as 

n = inLfz, 1 - F (x) > ^} 
2n 

Using lemma 7 with L — 9 n , we obtain that there exists a distribution P = 
Yh=! $x t Pi such that 

KL(f 0>n ,f P ) < e 2 n , and J / , n log < e 2 

Note that /p has support [0, 9 n ] and is such that fp(0 n ) > 0. Now, set m = rife 
and consider P' the mixing distribution associated with {m, a;' 1; . . . , x' m ,Pi . . . ,p' m } 
with YZLt Pt = I- Define for 1 < i < m-1 the set £/* = [OV^— /M, Xi+e^/M] 
and U m — (0 n ,0 n + e n (L — 9 n ) A e n /M]. Construct P' such that x\ £ Ui and 
\P'{Ui) ~Vi\ < e 2 m-\ We get 

Vi e [OA] /M*) > 

Given that x' m £ U m , we get x' m < 9 n + e n (L — 0„) A e^/M < 0„ for n large 
enough. Note also that p' m > p m — e^mT 1 . Given the construction of Lemma 7, 
we deduce 

. fa(Xi-l) > , / v 
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for n large enough. Furthermore, 

Vz < 6 n , f (z)(L - z) > J f (t)dt > |^ 

thus 

n n 

and deduce that ||/o/jp'||oo ^ 7^ Lemma 8 from Ghosal and van der Vaart 
(2007) gives us that 

J"™ / O (a01og (j^-) (x)dx < (e 2 n + H 2 (f P J P ,)) (1 + | log(e„/n)|) 

< (4 + |/p-/Hi)(l + |log(e n /n)|) 

Given the mixture representation (1) of /o and /p, we get 
(4 + |/p-/p'|i) (l + log(n)) 

~ ( e » + / " I E(- - 4)^ + E -fe'* - l *<*0 dx ) c 1 + lo sW) 
< (4 + E S - ^ + E ifi - »i + E -1^ - ^1) a + 1 lo s(")i) 

<4(l + |log(n)|) 

Generally speaking, denoting J7 = [0, l]n(U™ andW = {P', |P'(£/i)-pi| < 
e^m -1 } we obtain that for all P' e Af 



f (x) log (A) (^dx < 4(1 + I log(n)|) 

and similarly 



/o(x)log(A) (x)dx<e2(l + |log(n)|) 2 



VP 

for e n small enough. Note also that for all P' € Af and n large enough, similarly 
to before we get 

„T. 



/ fp<{x)dx 



< 



3/2 

We now derive a control on fc, the number of steps until Sk < in the 
construction of Lemma 7. At step k — 1, we have £fc_i > e^ 2 . It is clear that 
for all j, £j < 2 -1 ' 2 £j-_i, thus 

M l/ 2jL l/2 2 -(fe-l)/2 > > £ 3/2 

log(MV^i/ 2 )_( fc _i)i2^) >^l g( e „) 
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Finally, we have 



k ~ b^2) (l0g(MV2Ll/2) " I l0g(£n)) + 1 (28) 



We can then get a lower bound for II [Af] and, given that for e n small enough 
and n large enough, we have 

Af C S n (e n ,d n ) 

we can deduce a lower bound for II ^SVi(en, &n)j ■ For the Type 1 prior, we have. 
n[W] = Pt(V(Aa(U ),...,Aa(U nk )) e [ Pt ±e 2 Jn k ]) 

r(A) 
n.rCAa^)) 



(Pi-e|/"fe)A0 



For € n small enough, we have Aa(Uj) < 1 for all j, and the integrand could be 
bounded from below by 1. Furthermore, T(Aa(Uj) = T(Aa(Uj) + l)/Aa(Uj) < 
1/AaiUj). 



U[N] > T(A)T\Aa(U l )(2^) 

> exp |log(r(A)) +Y^log(Aa(Ui)) + n fc (21og(e„) - log(n fc ) + log(2)) 

> exp j^log^a^)) + i(31og(e n ) +log(2))| 
Given (2a), we get 

a([/i)x / 000*60 



thus 

a(Ui) x 2e 3 n a Xi l . 
It is obvious that for every j, xj > 2~ k thus 

log(a(Uij) > 31og(e„) + log(a ) +tlog(x l ) 
5^1og(a(l7<)) > n fe (31og(e n ) + log(a ))+i^log(x I ) 

i i 

> n fc (31og(e n ) + log(a ) - tfclog(2)) 



JB Salomond/ Concentration rate for monotone density 

Using (28) 

Iog(o(Ci)) > n k (3 log(e„)(< + 1)+ 

i 

log(oo) + t(log(2) - 21og(M# 2 )) 

> Af 1 /30l/3 Koe -1^3 log(en)(< + 1) + 

log(oo) +ipog(2) ~ 21og(A/ 1 / 2 ^/ 3 )) 
^de^log^) 

for n large enough and e sufficiently small. For the Type 2 prior, we write 
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M> = y = -pj\ ^ e9 /»k, K - *il < 4| c s n (e n ,e n ) 

we then deduce a lower bound for H[S n {en,@n)] 

n k i-Pi+e 2 /n k "k 

n[AT] > Q(Jf - rut) n n^c"* / wfdwj n a(C^) 

j = l ./max(0,p;-e 2 /™fc) •_- L 

> exp j-crifc logn fc + log(a(f/ i )) + n fc log(c) - n k \og(n k ) + a 3 log(2e 2 /n fc )j 

> exp{Cje- 1 log(e)} 



A. 3. Proof of Lemma 9 

Proof. Recall that given (1), /(0) = L 1} \dP(6). Then 

r r 1 1 1 r r 2 

n 

Note that 



If 1 ! 
/ - B dP{9)>M n 




/ fl dP ( fl )+/ 


= n 


Jo V 




JO w J2M~ 



I 



dP(0) > M n 



[ \dP{6) < M n /2 f dP{9) < M n /2 



□ 



Thus the set {P, f* Mn ' 6- x dP{6) > M n /2} contains 7£ and 
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nra < n 



/ -dP(0)>M n /2 
Jo 6 



< 2M~ 1 E 



2 AC 



ndP(O) 



using Markov inequality. Then for a Type 1 prior when n large enough 



nfj- c i < 2M~ 



2M, 



-a{9)d9 



Jo * + l 



For a Type 2 prior, we have that 



/ oo 



min Xj < M n 1 



< [J2kQ(K = k))a([0,M- 1 ] 



< 



(jl e -cn 1/3 log(n) 2/3 



□ 



A. 4- Proof of Lemma 1 1 

Proof. Let = oq < ■ ■ ■ < = 1 such that 

dP = dP =■■■ = 



I dP= dP=-= =V~e/2 

J a J a t J a k _ t 



then 



k-l 



J2 dp = fc \/e/ 2 = 1 

thus k = 2/^Jl.. The set of function {I[ ai> a i+ i]}i forms a e-bracket as 

y(I[a,,a s + 1 ]-]I[a s + 1 ,a s + 2 ]) 2 ^< 



which leads to the result 
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N [] (e,Q,L 2 )<k<C[- 



1/2 



□ 



Appendix B: Adaptation of Theorem 4 of Rivoirard and Rousseau 
(2009) 

Theorem 12. Let fo be the true density and let H be a prior on J- satisfying the 
following conditions : there exist a sequence (e„) such that e n — > and ne\ — > oo 
and a constant c > such that for any n there exist J- n C J- satisfying 

n(J- J c J= (cxp(-( C + 2)n62)) 

For any j E N* let T n j = {/ € J"„, je„ < d(f, f ) < (j + l)e„} and H nJ the 
Hellinger (or Li) metric entropy of !F n j. There exists a J n such that for all 
j > Jo.n 

H n , 3 <(K- l)ne 2 n j 2 
where K is an absolute constant. 

Let S n be defined as in (11), thus if 

U(S n (e n ,e n )) > exp(-cne I 2 l log(n)) 

We have : 

n(/ : d(f , f) < J ,ne«|X n ) = 1 + o P (l) 



The proof of this Theorem is similar to the one proposed by Rivoirard and Rousseau 
(2009), using an adaptation of Theorem 2.1 of Khazaei et al. (2010), to handle 
the fact that the Kullback-Leibler property is not satisfied. 



